Skip to content

refactor!: rename and rewrite Handle impls#220

Open
dssgabriel wants to merge 30 commits intokokkos:developfrom
dssgabriel:refactor/handles
Open

refactor!: rename and rewrite Handle impls#220
dssgabriel wants to merge 30 commits intokokkos:developfrom
dssgabriel:refactor/handles

Conversation

@dssgabriel
Copy link
Copy Markdown
Collaborator

@dssgabriel dssgabriel commented Mar 5, 2026

Description

Similar to the recent work on requests (#216), this PR attempts to rename and rewrite the implementations of Handles, now known as Communicators, in view of stabilizing the Kokkos Comm API for our first official release.

Additional context

Motivation
These changes attempt to stabilize Kokkos Comm's public API for "handles". In particular, it aims at:

  • Settling on a "correct" name for these objects that wrap a library-specific communicator handle and a Kokkos execution space
  • Removing any backend-dependent interfaces from their public API
  • Expanding the capabilities of these objects
  • Defining a set of mandatory supported operations
  • Unifying implementation strategies in the backend specializations (as much as possible)
  • Bringing the documentation up-to-date

Implementation notes

Unified interface across backends:
The central design constraint was that Communicator<Ex, MpiSpace> and Communicator<Ex, NcclSpace> must expose identical public interfaces, so that code written against the abstraction does not need to branch on the backend.

Factory functions instead of constructors:
Construction is intentionally not possible via public constructors because there are multiple ways to create communicators. Static "factory" member functions (from_raw, duplicate_from_raw, split_from_raw) were chosen for three reasons:

  1. They have meaningful names that make construction intent explicit at the call site;
  2. They can return std::optional to force callers to handle failure without exceptions; and
  3. They make it impossible to accidentally construct a Communicator in an invalid state.

A builder pattern was considered but rejected — with only three construction modes and minimal configuration, a builder would add object lifetime complexity for little real benefit, and in the long run, could promote backend-specific setter methods that would gradually erode the generic interface.

Non-owning from_raw:
from_raw never takes ownership of the raw handle it is given. If a user created a raw communicator, they opted out of the abstraction layer and retain responsibility for its lifetime. Allowing optional ownership transfer would reintroduce a boolean parameter at the call site and recreate the exact class of double-free and leak bugs the design is meant to prevent. The ownership table is simple and memorable: only duplicate and split produce owned communicators, because only they allocate a new backend handle internally.

Explicit copies, move-only by default:
The copy constructor and copy assignment operator are deleted. Duplicating a communicator is an explicit MPI/NCCL operation with observable side effects and a potential for failure; silently copying a Communicator as though it were a value type would hide that cost entirely. Users who need a copy call duplicate(), which makes the intent and the cost visible. Move semantics are implemented manually rather than defaulted: the move constructor uses std::exchange to null out the source handle and set owned_ to false, ensuring the moved-from object's destructor is always a no-op and preventing double-free.

Changes

  • Affected areas: all
  • Breaking change(s)? yes
  • Renamed Handle -> Communicator across the entire code base, including file names, type aliases, and documentation
    • After discussing with potential users during the 2026 HPSF Community Summit, this is the name that seemed to be the most well understood by the community
  • Rewrote both MPI and NCCL Communicator specializations with a unified interface
    • Added duplicate_from_raw and split_from_raw factory operations for both backends, each accepting either a raw backend handle. Also support duplicating/splitting from an existing KokkosComm::Communicator using duplicate and split member functions.
    • Added from_raw factory function, which borrows a raw backend handle without taking ownership
    • Unified accessor interface across both specializations: exec(), comm(), size(), rank()
    • Deleted copy constructor and copy assignment operator; use duplicate() for explicit copies
    • Implemented correct move constructor and move assignment operator
  • Refactored all existing tests to use the new Communicator API
  • Added a dedicated unit test suite for Communicator, covering factory functions, member functions, accessors, and move semantics for both MPI and NCCL backends
  • Rewritten documentation fully covering the new Communicator object
  • Updated clang-format config to avoid bin-packing arguments, improving readability of functions with long signatures (especially factory functions with attributes, default values, etc.)

Drive-by changes

  • Enabled previously forgotten broadcast and all-gather tests (files existed but weren't declared in CMake)

Checklist

  • Tests are up-to-date
  • Documentation is up-to-date

Simple wrapper of `int` with UDL for easy creation
Makes the code much easier to read for functions with attributes and
long type/param/tparam names.
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
@dssgabriel dssgabriel added this to the Version 0.3.0 milestone Mar 5, 2026
@dssgabriel dssgabriel self-assigned this Mar 5, 2026
@dssgabriel dssgabriel added E-help-wanted Call for participation: help is requested and/or extra attention is needed A-core Area: KokkosComm core library implementation E-hard Call for participation: hard - high experience level required A-mpi Area: KokkosComm MPI backend implementation A-nccl Area: KokkosComm NCCL backend implementation labels Mar 5, 2026
@dssgabriel dssgabriel marked this pull request as draft March 5, 2026 17:59
@dssgabriel dssgabriel changed the title refactor(core)!: rename and rewrite Handle impls refactor!: rename and rewrite Handle impls Mar 5, 2026
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
`do_iteration` was taking its variadic arguments by value, causing the
compiler to attempt a copy of any argument passed to it. This triggered
a compile error when passing a `Communicator`, whose copy constructor is
deleted by design.

Fixed by taking `Args&&...` and forwarding with
`std::forward<Args>(args)...`.
Correctly replace `exec_` with the passed in `exec` parameter.
Directly call the factory `split` instead of factory `duplicate`, since
member functions can access their rank without having to recompute it.
@dssgabriel
Copy link
Copy Markdown
Collaborator Author

I cannot test on a NCCL system right now, but everything passes for both Open MPI (5.0.9) and MPICH (5.0.0) for the MPI backend (non-CUDA enabled) on my machine.

For both MPI and NCCL specializations, covers:
- Factory functions (`from_raw`, `duplicate`, `split`);
- Accessors (`comm`, `size`, `rank`);
- Move semantics.
@dssgabriel
Copy link
Copy Markdown
Collaborator Author

@cwpearson Can you enable the SNL CI so I can fix any outstanding issues before making this ready for review?

Some drive-by updates:
- Removed mentions of `barrier` in the core API
- Fixed mentions of ~~`Req`~~ -> `Request`
@dssgabriel dssgabriel marked this pull request as ready for review March 12, 2026 08:31
Copy link
Copy Markdown
Member

@cedricchevalier19 cedricchevalier19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we can get rid of the non owning aspect for the Communicator.
The code and the semantics may be simpler.

The other thing is the new Communicator should be used by all public functions of Kokkos Comm that use a communicator. So even the advanced functions in ::mpi::, should not take a MPI_Comm.

// Create a handle
KokkosComm::Handle<> handle; // Same as Handle<Execspace, CommSpace>
// Create a communicator
auto comm = KokkosComm::Communicator<Co, Ex>::duplicate(raw_comm_handle, exec_space);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For later, we should deduce the correct template parameters from the arguments. Or do we make a function that duplicates a Communicator.

Copy link
Copy Markdown
Collaborator Author

@dssgabriel dssgabriel Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we cannot deduce without taking an MpiSpace tag as argument, since we cannot define CTAD-like rules for static member functions.
An alternative could be the free function you mention, but I don't know about that. The name would need to reflect that it operates on a communicator, which isn't as clean as having a static member function.

One easy improvement we can make is to default the template parameters of the Communicator:

template <
  CommunicationSpace Comm = DefaultCommunicationSpace,
  KokkosExecutionSpace Exec = Kokkos::DefaultExecSpace>
struct Communicator;

which allows for simpler "default" code:

auto comm = KokkosComm::Communicator<>::duplicate(raw_comm_handle, exec);

@dssgabriel
Copy link
Copy Markdown
Collaborator Author

dssgabriel commented Mar 17, 2026

I am wondering if we can get rid of the non-owning aspect for the Communicator.
The code and the semantics may be simpler.

TLDR: I think we should keep it for now.

I tend to agree, but this makes it harder for existing MPI-based codes to enter the Kokkos Comm "ecosystem". Namely, for libraries or apps that already manage their own MPI communicator (create/dup/split + free), having the from_raw static member function makes it very easy to tap into Kokkos Comm with minimal changes. As already discussed at length, having strong interoperability with MPI-native objects is one of our goals, and I think from_raw helps with that.

Some design suggestions, though:

  • Make it even simpler and not return a std::optional; what you pass is what you get. This also more clearly separates this function from duplicate and split, and "accentuates" the fact that we are not responsible for managing this communicator.
  • Mark it [[deprecated("Prefer creating Communicators with duplicate() or split() instead")]], so we can plan its removal from the public API at some point (e.g., once we have sorted out initialization and can do everything ourselves)?

The other thing is that the new Communicator should be used by all public functions of Kokkos Comm that use a communicator. So even the advanced functions in ::mpi should not take an MPI_Comm.

I agree. I have other PRs coming that clean up the p2p and colls interfaces (as part of #200) which are in line with this.

This could lead to a compilation error if both MPI and NCCL backends
were enabled at the same time, where `DefaultCommunicationSpace` would
be defined twice.
Now, we ensure it is only defined once. NCCL takes precedence over MPI
if it has been enabled.
Update tests and docs.
Remove `std::optional` from `from_raw` signature.
@dssgabriel
Copy link
Copy Markdown
Collaborator Author

Tested on my machine with MPI + CUDA and everything works fine.
I'll try to run on a multi-GPU system so I can test NCCL while we wait for SNL's CI.

Add unit tests for exec() accessor, self-move assignment safety,
duplicate independence chain, and sequential splits.
@dssgabriel
Copy link
Copy Markdown
Collaborator Author

I added a few more unit tests for Communicators and got them to run with the NCCL backend on a GPU system.

This is ready for review and merge @cedricchevalier19.
@cwpearson, can you enable the SNL CI, please?

Copy link
Copy Markdown
Member

@cedricchevalier19 cedricchevalier19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possible improvement: rename comm to mpi_comm in the internal MPI API.
Not blocking.

Comment on lines +22 to 28
auto req1 = KokkosComm::send(comm, data, dst_rank);

// Initiate a non-blocking send with a default handle
auto req2 = send(data, dest);
// Simulate a blocking send by waiting immediately
KokkosComm::send(comm, data, dst_rank).wait();

// Wait for the requests to complete (assuming a wait function exists)
// Wait for a request to complete
KokkosComm::wait(req1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For later, @hmt23 can you confirm this is also correct from a MPI point-of-view?

@dssgabriel
Copy link
Copy Markdown
Collaborator Author

dssgabriel commented Mar 23, 2026

One possible improvement: rename comm to mpi_comm in the internal MPI API. Not blocking.

I think we can add an overload for it, but not outright rename it, or we will lose the "generic" interface between Communicator specializations (i.e., users will need to know which one they are using if they want to retrieve the raw comm handle)

@cedricchevalier19
Copy link
Copy Markdown
Member

One possible improvement: rename comm to mpi_comm in the internal MPI API. Not blocking.

I think we can add an overload for it, but not outright rename it, or we will lose the "generic" interface between Communicator specializations (i.e., users will need to know which one they are using if they want to retrieve the raw comm handle)

I was thinking about renaming the parameters, no change in the API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-core Area: KokkosComm core library implementation A-mpi Area: KokkosComm MPI backend implementation A-nccl Area: KokkosComm NCCL backend implementation E-hard Call for participation: hard - high experience level required E-help-wanted Call for participation: help is requested and/or extra attention is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants